mistralai/Devstral-Small-2505 · Thanks. Dedicated math and coding models is the only reasonable path forward.

13 days ago

•

Separating out math and coding from general purpose AI models is ESSENTIAL for the future success of AI.

The first few gens of LLMs were horrible at coding and math, as they should have been. Statistical next token prediction is ill-suited to make it through long precise tasks like code blocks and mathematical proofs in order to reach the accurate final products such tasks require.

Consequently, the age of coding and math overfitting began, such as ending the training of general purpose AI models on trillions of math and coding tokens in order to force next token prediction to reach the higher accuracy coding and math require, but in the process, scrambling the weights storing non-overfit tokens. This happened across all model families. For example, after Qwen2.5 72b overfit math, coding, and STEM relative to Qwen2 72b it was left with far less broad knowledge (e.g. only 10/100 on SimpleQA), making it little more than an hallucination generator when it came to most non-coding/math tasks.

The only tenable solution to this overfitting pandemic is to remove the bulk of coding and math tokens from general purpose AI base models and create dedicated coding and math agents that are overfit to coding and math, allowing them to reach the requisite accuracy on long precise tasks like coding and math. We shouldn't even be testing general purpose AI models on anything but a single basic math and coding test (e.g. GSM8K).

testai111

13 days ago

We also need special test for each programming language. Let's say high scores in Python of an LLM mean nothing if I need help in Dart and Flutter environment.

phil111

13 days ago

@testai111 That's a good point. Not only is statistical next token prediction ill-suited for long precise tasks like coding, but the situation is compounded by the large number of diverse programming languages and environments, each requiring unnaturally high precision for LLMs. So even after grossly overtraining LLMs on coding tokens it's unlikely to produce a sufficiently capable coder across all extant programming languages and environments.

noneUsername

13 days ago

It's you again. Following the fallacy that "SimpleQA can reflect the overfitting degree of LLM", you start to claim that "next token prediction paradigm is not suitable for mathematical and coding tasks".
I am sure you are playing dumb, just like you have shown in many previous arguments. You are avoiding the concept of "reasoning task", right?
Mathematical formulas are very certain, oh it is not suitable for LLM. Codes are very certain, oh it is not suitable for LLM.
So is the content filled in "Alice's father is Bob, Alice is Bob's ___." Certain enough?
You didn't mention "reasoning" once, not because you are stupid, you are intentional.

phil111

13 days ago

•

edited 13 days ago

@noneUsername Hi again.

Firstly, it's not that math and coding aren't suitable for LLMs. I never used that phrasing. What I'm saying is only that statistical next token prediction have far more difficulty with long high precision tasks like coding and math that must ultimately compile or give the one and only resultant. Which contrasts with tasks like story writing which are far more forgiving, especially since most readers won't pick up on most of the errors, such as story contradictions.

And because coding and math require so much more precision the first few generation of AI models trained on a balanced corpus of knowledge, including coding and math, performed horribly at said tasks. Consequently, model makers started overcompensating by ending training on trillions of natural and synthetic coding & math tokens, and in the process DRASTICALLY reduced the broad knowledge of LLMs relative to their predecessors.

Secondly, I'm obviously not going to publicly disclose my test questions so they don't end up in the training data. But since I run the exact same prompts across all models and can compared the nuanced differences, the weight scrambling becomes overwhelmingly obvious. For example, Qwen2.5 72b reliably gives a lot of the same responses as Qwen2 72b, but with a notable uptick in hallucinations. For example, when asked for the 6 main characters and the actors who portrayed them from a popular TV show or movie they both start listing the same characters, but Qwen2.5 reliably makes far more mistakes, such as the wrong last name or associating the wrong actor with a character. So the weight scrambling is clear as day, and it got even worse going to Qwen3.

But since no rational person should just take my word for it I reference the next best thing, which is SimpleQA. Other tests like the MMLU only cover a tiny fraction of humanity's most popular domains of knowledge, and most are from STEM, and all from academia. Plus the test is multiple choice, so even if a model's weights become notably scrambled and can no longer accurately retrieve the correct answer, as is required in real-life use cases, it can usually still pick the right answer out of a line up. So the MMLU's very limited coverage and multiple choice design makes it a magnet for things like contamination and overfitting, while no publicly available test other than SimpleQA requires both full recall and covers a broad spectrum of questions.

So models that didn't overfit select domains like coding and math, such as Llama 3.1 70b, have the predicted ~20/100 SimpleQA score as noted by OpenAI. That is, there's a very reliable correlation between total parameter count and the SimpleQA score.

Lastly, don't get me started on reasoning. None of the current models are "thinking" through math or coding problems in any sense of the word, or achieving any kind of generalized intelligence. They're just recalling patterns. For example, many of them will just repeat a poem word for word when asked to re-write it. As previously stated, coding and math require numerous high precision steps, so the primary purpose of the thinking tokens is to mitigating stupid mistakes, not just while progressing through the steps, but in the interpretation of the user's prompt.

barleyspectacular

11 days ago

Hi I am a noob, just curious:

Would a diffusion based model work better for long-form text like coding?

phil111

11 days ago

•

edited 11 days ago

@barleyspectacular Hopefully a coder, which I'm not, will chime in and answer your question.

But according to Google's new diffusion model it's performing worse, but in theory can perform better. This is primarily because previous errors, such as undeclared variables, can be fixed. Plus long stories can be more organic since they're edited all at once.

However, to be fair, traditional transformer models can easily address this issue by using another thinking block after the output to fix errors. That is, all skilled human writers and coders plan, create, then edit before producing a final product. So by simply adding another thinking block after the output to fix errors like undeclared variables the transformer model can overcome one of its biggest weaknesses (being pigeonholed by the previous outputted tokens). So I honestly think that text diffusion models will never be as good, let alone better, when it comes to anything other than speed.

barleyspectacular

11 days ago

@phil111 well to be fair speed can allow for more thinking iterations or iterations in general, in the same amount of time so that could be an improvement alone. But I hear you.

The problem I've found with the approach you suggest is it only works one way because of the limited context length - especially with thinking models. That is, it works in the sense of planning a whole story and then executing on each chapter or subsection. However cohesiveness is still a huge problem. Even feeding the last few paragraphs of the previous text to flow into the next text yields shaky results.